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Abstract — Probabilistic graphical models are a fundamental 
tool in statistics, machine learning, signal processing, and 
control. When such a model is denned on a directed acyclic 
graph (DAG), one can assign a partial ordering to the events 
occurring in the corresponding stochastic system. Based on 
the work of Judea Pearl and others, these DAG-based "causal 
factorizations" of joint probability measures have been used 
for characterization and inference of functional dependencies 
(causal links). This mostly expository paper focuses on several 
connections between Pearl's formalism (and in particular his 
notion of "intervention") and information-theoretic notions of 
causality and feedback (such as causal conditioning, directed 
stochastic kernels, and directed information). As an application, 
we show how conditional directed information can be used 
to develop an information-theoretic version of Pearl's "back- 
door" criterion for identifiability of causal effects from passive 
observations. This suggests that the back-door criterion can be 
thought of as a causal analog of statistical sufficiency. 

I. Introduction 

The problems of causality in engineered and natural 
systems have recently attracted the attention of information 
theorists and signal processing researchers [l]-[6]. The well- 
worn but nonetheless true maxim stating that "correlation 
does not imply causation" means that causal relationships 
cannot be captured by standard information-theoretic quan- 
tities like mutual information, conditional entropy or di- 
vergence, because all of these are measures of statistical 
dependence (i.e., correlation). The first information-theoretic 
studies of causality were concerned with feedback commu- 
nication systems and led to the development of the notion 
of directed information by Massey [7], with subsequent 
extensions and generalizations by Kramer, Tatikonda, and 
Mitter [8]— [10]. Connections between directed information 
and sequential prediction, source coding, and hypothesis 
testing have also been extensively investigated [1 1]— [14]. 

However, causality has also been the subject of vigorous 
study in the statistics, artificial intelligence, and machine 
learning communities [15]— [18]. The key idea advanced in 
these works, particularly by Pearl, is that causality is syn- 
onymous with functional (rather than statistical) dependence. 
In other words, causal relationships correspond to stable 
deterministic mechanisms, by which one set of variables 
(the causes), together with some possibly unobserved ex- 
ogenous disturbances, may affect another set of variables 
(the effects). Thus, inferring causal relationships requires 
active experimentation that intervenes into some of these 
mechanisms. In very schematic terms (this discussion will be 

This work was supported by NSF grant CCF-1017564 and by AFOSR 
grant FA9550-10-1-0390. 

The author is with the Department of Electrical and Computer Engineer- 
ing, Duke University, Durham, NC. E-mail: m.raginsky@duke.edu. 



made precise in the sequel), an ideal setting for identifying 
or estimating the "causal effect" of one observable (say, 
X) on another (say, Y) would permit the experimenter to 
disconnect X from all mechanisms that influence it, force X 
to take on some value(s) of interest, and then to estimate the 
probability distribution of Y as a result of this intervention, 
while controlling for all possible spurious influences and 
factors. This is quite different from estimating the statistical 
effect of X on Y, i.e., the conditional distribution Py\x> by 
means of passive observations, e.g., from a large number of 
independent samples from the joint distribution of X, Y . 

The purpose of this mostly expository paper is to intro- 
duce the information theory, control, and signal processing 
communities to several key concepts of the probabilistic 
theory of causality and, along the way, to elucidate several 
connections between Pearl's treatment of interventions on 
the one hand, and information-theoretic concepts pertaining 
to causality (such as directed information [7], causal condi- 
tioning [8], or directed stochastic kernels [9], [10]) on the 
other. In particular, the representation of causal relationships 
by Markov factorizations of joint probability distributions 
w.r.t. directed acyclic graphs (DAGs) [15]— [18], such that 
the natural partial ordering of the vertices of the DAG 
corresponds to causal ordering of the events in the system 
under consideration, should be very congenial to systems 
theorists, who naturally think in terms of block diagrams, 
interconnections, and sequential recursive models. 

Let us give a brief overview of the remainder of the 
paper. We first motivate the functional view of causality 
in Section [II] by means of a simple example of a point- 
Next, 
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to-point communication system. Next, in Section 
develop the general framework for studying causality in 
Markovian dynamical systems. In particular, we motivate 
Pearl's definition of intervention as "surgery" on a sequential 
recursive representation of such a system, whereby the 
relations defining the intervened-upon variables are deleted, 
and all instances of these variables in the remaining relations 
are assigned to some fixed value. This operation has a 
natural diagrammatic representation on the DAG inducing 
the Markov factorization of the joint probability distribution 
of the system observables according to the sequential model. 
We also show that the probability distributions induced by 
this operation (i.e., what Pearl calls the causal effects) are in 
one-to-one correspondence with the directed stochastic ker- 
nels of Tatikonda and Mitter [9], [10]. This correspondence 
is then used in Section [IV] to show how directed information 
(and certain generalizations, such as conditional directed 
information) can be used to quantify the strength of causal 
effects by comparing them with ordinary (observational) 



Fig. 1. A generic communication system without feedback. 
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conditional distributions. Section [Vjdevelops an information- 
theoretic interpretation of Pearl's "back-door" criterion [18, 
Sec. 3.3.1] (a sufficient condition for identifiability of causal 
effects from observational data) in terms of conditional 
directed information, showing in effect that the back-door 
criterion can be viewed as a natural causal analog of statis- 
tical sufficiency. 

II. Revealing causality through functional 

DEPENDENCE 

To illustrate the difference between statistical dependence 
and causal dependence, consider the standard diagram of a 
point-to-point communication system without feedback, as 
shown in Figure [T] A message W is mapped into a channel 
input symbol X = e(W), X is transmitted over a channel 
with transition kernel Py\x > an d the resulting channel output 
symbol Y is processed at the receiver into a decoded message 
W = d(Y), where e and d are some deterministic encoding 
and decoding functions. 

It is intuitively clear that the message W "causes" the 
decoded message W and not the other way around, but we 
cannot tell this from the joint distribution of W, X, Y, and 
W. Indeed, we have 

= Pw(w)l{e(w)=x}Pr\x(y\x)l{el(y)=w}, 

so that the joint distribution of W and W, given by 

P Ww( W '™) = P w(w)^2l {e{w)=x} P Y \x(y\x)l{d(y)=w} 

= Pw(w)J2 P Y\x(y\e(w))l {d{ y )= ^} 

y 

= P w {w)P W]w (w\w), 

can also be factored as P W yy(w, w) — P^(w)P w ^(w\w), 
which merely shows that W and W are statistically depen- 
dent on one another. Indeed, to quote Massey [7], "statistical 
dependence, unlike causality, has no inherent directivity." If 
the encoder, the channel, and the decoder are nondegenerate, 
so that I(W; W) > 0, then the dependence between the 
message W and the decoded message W is completely 
symmetric: W depends on W, and W depends on W. 

In order to elicit the causal influence of the transmitted 
message on the decoded message, as well as the lack of 
causal influence in the opposite direction, we need to break 
this symmetry. To that end, let us represent the stochastic 
transformation X —> Y effected by the channel Py\x as a 
deterministic mapping Y — f(X, U), where U is random 
channel noise, assumed to be independent of W and X. 
(Indeed, any stochastic kernel Py\x can be represented 



Fig. 2. An equivalent diagram of the system in Figure [T] 

in this form for a suitable choice of / and Pry.) This 
representation is shown in Figure [2] 

Now we can represent our communication system in the 
following sequential form: 

W - P w 
U~Pu 

X = e(W) (1) 
Y = f(X, U) 
W = d(Y) 

What happens if we make a hard assignment W w of a 
specific value w to the transmitted message? Looking at the 
sequential model in ([TJ, we see that this action will influence 
the "downstream" variables U, X, Y, W as follows: 

U~P V 
X = e(w) 
Y = f(e(w),U) 
W = d(f(e(w),U)). 

The corresponding joint distribution of U, X, Y and W 
resulting from the action W <— w, which we will denote 

b y P UXYW\W^W has the form 
P UXYW\W^w( u i X > Vi ^) 

— Pu(u)l{ e (w)=x}l{f(e(w),u)=y}l{d(f(e(w),u))=w}' 

Marginalizing out the channel noise U, the channel input X, 
and the channel output Y, we get 

P W\W<-w(™) = ^2 P u(u)l{f(e(w),u)= V }~l-{dU(e(w),u))=w}- 

This distribution is, in fact, equal to the ordinary conditional 
distribution Pw\w=-w> § iven b y 

P W\W=io(u>) =^2 P Y\x(y\e('w))l{d(y)=w} 

y 

= ^2Pu(u)l{f(e(w), U )=y}l-{d(f(e(w),u))=w}- 

Again, assuming that the mappings e, /, d are nondegenerate, 
there exist at least two values w, w' for the transmitted mes- 
sage, for which Pyy\ w =w p w\w=w> and > consequently, 
p w\w<-w p w\w<-w- In other words - the downstream 
effect of the hard assignment W -s— w is different from that 
of W <- w'. 

Now let us consider what happens if we make a hard 
assignment W w of the decoded message. One way to do 
this would be to replace the original decoding map d with 
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Fig. 3. A generic stochastic dynamical system with multiple feedback 
loops and exogenous disturbances. 



the constant map dw (y) — w for all y. The effect of this hard 
assignment on the remaining variables can be represented as 

W~ P w 
U~P V 
X = e(W) 
Y = f(e(W),U) 

This clearly shows that the joint distribution of the "up- 
stream" random variables W, U, X, Y is unaffected by the 
action W ■<— w; in fact, exactly the same conclusion would 
hold if we replaced the original decoding map d with any 
other decoding map d! . In other words, 



P, 



P 



P, 



P 



w ■ 



WUXY\W+~w ~ 1 WUXY j * W\W<-w 

which shows the absence of causal influence of W on W. 

III. Causality in sequential dynamical systems 

The simple example of the preceding section illustrates 
the general treatment of causality advocated by Pearl. To 
motivate it, let us consider a stochastic dynamical system 
with multiple feedback loops and exogenous influences (or 
disturbances) shown in Figure[3] The exogenous disturbances 
are modeled by n random variables U\ , . . . , U n with a fixed 
joint distribution Pun = P\j 1 ...u n , while the system observ- 
ables are represented by n variables X\, . . . , X n , related to 
U n and to one another by n coupled equations 



x i = Mx n ,u n ), 



l € 71 



(2) 



We assume that the system specification is sound in the sense 
that the equations Q have a unique solution X n = x n for 
any realization U n = u n of the exogenous variables. This 
representation of stochastic dynamical systems as multiple 
feedback loops was used by Witsenhausen [19]— [21] in his 
seminal work on distributed control systems. 

This description allows for arbitrary dependencies between 
the observables X\, . . . ,X n , including cycles of the form 
X j = fj(X h Uj), X k = f k {X v U k ), X % = fi(X ki Ui). In 
order to study causality, we will limit ourselves to sequential 
dynamical systems, in which the observables X\ , . . . , X n are 
ordered in such a way that, for each i E [n], there exists a 
set Hi C [i— 1], such that the function /j depends essentially 
only on X n - = (Xj : j G IT) and on Uf 

X i = f i (X u *,U i ), i€[n] (3) 



Moreover, if for each i the exogenous variable Ui is indepen- 
dent of (X 1 ^ 1 . U % ~ 1 ), then the sequential model |3]l specifies 
the joint distribution Py™ Via me Markov factorization 



Px«(x n ) = l[P Xilx n i (x i \x U *), 



i=l 



where, for each i E [n], 



P Xl \x^(xi\aF*) = P Uz (Mx n %U t ) = Xl ) 



(4) 



(5) 



and XV- 1 ^ 11 * -> X 11 * -> X { is a Markov chain. We will 
refer to any stochastic dynamical system specified by ([3]) 
with independent disturbances U\ , . . . , U„ as a Markovian 
dynamical system. Apparently, one of the earliest attempts 
to study causality by means of simple Markovian models of 
this sort was made in the 1920's by the geneticist Sewall 
Wright [22]. 

The Markov factorization Q can also be represented in 
graphical form by means of a directed graph with n vertices, 
where vertex i is associated with X;, and there is a directed 
edge from vertex j to vertex i if and only if j E 11; . Because 
H C [i — 1], we end up with a DAG. Since we will use this 
graphical representation rather heavily in the sequel, let us 
pause to define some concepts associated with DAGs. Given 
i E [n], we let Aj C [n] denote the set of all descendants of 
i, i.e., the set of all j E [n]\{i}, such that there is a directed 
path from i to j. Similarly, we let Ai denote the set of all 
ancestors of i, i.e., all j E [n]\{i} connected to i by directed 
paths. We also let A+ = Aj U {«}, so that X, ; = [n]\A+ is 
the set of all nondescendants of i. Note that 



U A i c N >- 



(6) 



Indeed, if for some j E Xj there exists some k E Aj n At, 
then there is a directed path from i to j going through k, 
which is impossible by the definition of Xj. 

A. Interventions in Markovian dynamical systems 

Consider a Markovian dynamical system specified accord- 
ing to ([3]). Just as we did in the simple example of Section [II] 
we can study the causal effect of one set of variables X°, 
S C [n], on another set X T with SC\T — by examining the 



impact of hard assignments of the form X" 



c s on X T . 



The main idea is to start with the recursive representation 
(B), delete all equations defining the variables X,, i E S, 
and replace all other instances of these variables with the 
assigned values. For example, the effect of what Pearl calls 
an atomic intervention Xj Xi can be represented as the 
following modification of ([3]): 

'fj(X^,Uj)\ Xi=xi , ifjeA, 
fj(X^,Uj), ifjeX, 

Now, for any set T C [n]\{i}, let P x t\x 



X,- 



(7) 



denote 



the probability distribution of X T induced by the modi- 
fied model |7]). Other notation used by Pearl and coau- 
thors includes P x t\ x =£ (where hats are added to the 
intervened-upon variables and the values assigned to them) 



4 =xi)' we will use some of these interchange- which means that, for any x s and any additional intervention 



and P X -r\da(X. 

ably. The main claim is that these interventional distributions 
describe the causal effect of Xj upon X T . Let us see some 
illustrations in support of this claim. 

First of all, we would intuitively expect that the interven- 
tion Xi «— Xi would only affect the descendants of i. This 
is indeed true: 

Lemma 1. For any TCJV; and any intervention Xi <— 



P 



X T \Xi 



= P 



X T i 



where the distribution Px T on the right-hand side is induced 
by the original model ([3j. 

Proof. Because of d§), no Xk with fc £ A+ appears in 
any of the equations defining X 4 in d7]l. Hence, the joint 
distribution of X Ni in the modified model |7]i is the same 
as in the original model □ 

Since H C iVj, we nave 

Corollary 1. For any i £ [n] and any intervention Xi 4— x it 

Px n >\Xi^x, = Px u i ■ 

The extension to multiple interventions of the form X s <— 
x is immediate: defining the sets 

A s ^(jA i; A+^A S US, N s ±{n]\A+ 

ies 

we can represent the effect of the intervention X s «— x s on 

X sa = (X, :j£S) by 



if je Ac 



./;,S.V |: .1,1. if./-: A> 



and, for any T C [n]\S r , the interventional distribution 
Px T \x s <r-x s i s given by the joint distribution of X T induced 
by ([8]). Going through the same reasoning as before, we 
obtain the following generalization of Lemma [T] 

Lemma 2. For any S C [n], any T C N$, and any 
intervention X s — > x s , 



P 



X T \X s <-x s 



= P 



On the other hand, let us pick some i £ [n] and consider the 
causal effect of the intervention X Ui ■<— x Ui upon Xc 

Lemma 3. For any S C [n] and any intervention X Us <— 
x Us , we have 



P x s \x u s 



_ x n s — Pxs\x n s=x n s ■ 



Moreover, for any T C (S U Il5) c and any intervention 



X T <— x T , where T is disjoint form S U lis, we have 

Pxs\x n si-x n s ,x T +-x T i xS ) 

= P uS (f j (x I1 i,U j )=x j ,Vj£S) 

= Px s \X n S<^x n s (x S ) 
= Px s \X n S=x a s (x S ). 

In other words, the joint distribution of X s induced by ([8]) 



is unaffected by X 



□ 



In terms of the Markov factorization Q, we can express 
the interventional distributions Pxt\x s ^-x s f° r anv P a ir of 
disjoint sets S, T C [n] as follows. First, we write down the 
"global" interventional distribution of X s given the action 

X s <- x s , 

Px^\xs^ x s(x s °) = J] P Xi \xm{xi\x Ui ), (9) 
and then marginalize out all variables outside of T: 



Pxt\xs^ x s{x T ) 



2. P x^\x 

„s c nT c 



s^ x s{x 3 ) (10) 



Note that, in general, this is different from the ordinary 
conditional distribution P x tix s —x s > which has the following 
standard interpretation in Bayesian terms: Suppose we can 
only observe X s , but not X s . If we let system evolve 
freely according to Q and then observe that X s = x s , 
then Px T \x s =x s represents our posterior beliefs about X T 
based on the observed evidence X s — x s . 

B. Interventions in graphical models 

Graphical model representations of Markovian dynami- 
cal systems offer a convenient visual way of computing 
interventional distributions. Essentially, if we wish to write 
down the interventional distribution P x s c \ x s < _ x s, we draw 
the corresponding DAG, remove all edges incident upon 
the vertices in S, and write down the joint distribution of 
X s induced by the resulting DAG, while setting X s to the 
assigned values x s . 

Let us see this on a couple of examples. Consider the 
following graphical model: 



X 1 



c T , we have 




Px s \x n s+~x n s ,x T < 



Px s \x n s+-x n s — Px s \x n s 



Proof. Observe that, as a result of the intervention X Us <— 
x ns , we have 

X j = f j (x Ii ^U j ), Vj£S 



It specifies the joint distribution of X 6 

P x *(x e ) =Px 1 (zi)P X2 (x 2 ) 



, X 6 ) via 



x Px 3 \x^(x3\x 2 )P Xi \x 1 (x4\xi) 
x Px 5 \x 3 (x 5 \x 3 )Px 6 \xl(xe\xl). 



The effect of the intervention X 3 <— x 3 can be represented 
graphically as follows: 

Xi > X4 




Xi X3 X§ 

In other words, the intervened-upon variable X3, which is 
enclosed in a box, is disconnected from its direct causes in 
II3 = {1,2}, and an additional arrow is added to indicate 
the hard assignment Xs 4- X3. The resulting interventional 
distribution Px u x$\x a -*-x a can ^ e rea d °ff directly from the 
diagram: 

Px x ,x!>\x 3 <-x 3 {xi,x%) = Px 1 (xi)Px 2 (x 2 ) 

x Px^ixilx^Px^xA^lxs) 
x Px 6 \xl(x(>\4)- 

As another example, consider the following diagram, which 
depicts communication over a discrete memoryless channel 
Py\x using a sequence of possibly randomized feedback 
encoders Px^Xi-i^tii <= W : 



X\ >■ X2 




Yx Y 2 Y 3 •■• Y n 

The effect of the intervention Y\ y\ , . . . , Y n <— y n is 
represented graphically as 




Vi 2/2 V3 ■■■ y n 

and the corresponding interventional distribution is 

n 

Px"\Y"<-y"(x n ) = [[Px^Xi-uYi-xiXilXi-l^i-x). 

i=l 

C. Interventional distributions as directed stochastic kernels 

As it turns out, Pearl's construction of interventional distri- 
butions has been developed independently by Tatikonda and 
Mitter [9], [10] under the name of directed stochastic kernels 
in their work on the capacity of channels with feedback. 

Tatikonda and Mitter consider an n-tuple of causally 
ordered random variables X\, . . . , X n with joint distribution 

n 

Pxn(x n ) = Y[P Xil x^(x i \x i - 1 ) 

i=l 



(of course, we are free to factor Px™ along any other 
ordering of the variables, but the subsequent definitions 
depend on a fixed ordering). Then for any S C [n] they 
define the directed stochastic kernel P x s"ixs— X s by 

P x s lx s= x s(x sc ) 4 n Pxjxi-xiXilx*- 1 ). (11) 

It is easy to see that this definition is equivalent to Pearl's. 

Indeed, if we consider the DAG with n vertices that has 

Ilj = [i — 1] for each i € [n], then P x s"\x s =x s defined 

in (jTTJ is equal to P x s<=\x s ^x s defined in Conversely, 

if the variables Xi , . . . , X n are ordered in such a way that 

for each i £ [n] there exists some IT Q [i — 1] such that 
x [ 4 -i]\n, ^ x n t _^ Xi is a Markov chain? th e n 

Px^\xs^- X s{x sa ) = Y[ Px % \x^{xi\x Ui ) 

= n Px^-^iix*- 1 ) 

= PxS c \X s ^xs(x S ), 

where the first step uses and the second uses ( fTT| and 
the above Markov chain condition. 

D. Interventions as channels 

The interventional distribution Px T \x s ^x s can be viewed 
as a mapping from the set of all tuples x s = (xi : i G S) 
into the set of all probability distributions for X T . Any 
such mapping defines a channel [23] with input variable 
X s and output variable X T . If S = Ht, then Lemma [3] 
shows that this channel coincides with the specification 
of the conditional distribution of X T given X Ut in the 
intervention-free system. This equality of the originally pre- 
scribed stochastic kernels and the directed stochastic kernels 
holds whenever X s (resp., X T ) is the complete input (resp., 
output) variable of an encoder, decoder, or controller. By 
contrast, whenever Pxt\x s ^x s Px T \x s =x s f° r some 
x s , we can conclude that there are some additional causal 
or statistical relationships between X s and X T . 

IV. Directed information as a measure of 

CAUSALITY 

Now that we have motivated the notion of a causal 
effect, we can proceed to define various information-theoretic 
quantities that capture causality as opposed to dependence. 
Assuming, as before, a Markovian dynamical system of 
the form (|3), let us consider the interventional distribu- 
tion P x t\xs(-\x s ) for disjoint sets S,T C [n]. As we 
have pointed out already, this distribution is, in general, 
different from the conditional distribution Px T \x s i/\x s )- In 
particular, if P XT |^ S (-|x s ) = P x t(-) for any intervention 
X s <— x s , then the variables in S have no causal influence 
on those in T. On the opposite end of the spectrum, if 
P x t^ s (-\x s ) = Px T \x s i'\x s ), then the causal effect of 
X T coincides with ordinary conditioning. This observation 
suggests that, for each realization x s of X s , we may 



measure the average "strength" of the causal effect of the 
intervention X s <— x s on X T by the divergence 



D{Pxt\XS=xs\\Pxt\xs=*s) = E 



P xT ^ s (X T \x s ) 



where the expectation is w.r.t. the conditional distribution 
P x t\x s =x s ■ If we now average this w.r.t. the marginal 
distribution of X s induced by ([3]), then we obtain 



D(Px T \x s \\Pxt\x s \ p xs) 



E 



log 



P X T\ X s(X T \X S ) 

(X T \X S ) 



p 



(12) 



where D(Pb\a\\Qb\a\Pa) denotes the conditional diver- 
gence [24]. If T = S c , then we have 



D(Px s " \x s I! Pxs c \x s \ P x s ) 

P x s^ x s(X sc \X s ) 



E 



E 



log 



log 



IP 

P^C^P^IXS) 

P X Se| XS (X gC |X S ) 
P^l^^l^) 



where the second step uses the equivalence between the in- 
terventional distribution P x s e \ x s and the directed stochastic 
kernel P x s c \ x s- We can now recognize the last expression 
as the directed information I(X S —> X s ) from X s to 
X s as defined by Tatikonda and Mitter [10, p. 327]. This 
definition, in turn, generalizes the one proposed by Massey 
[7] in the context of communication over noisy channels 
with feedback. Thus, directed information arises naturally as 
an information-theoretic measure of causality: if I(X S — > 
X s ) is small, then the interventional distributions of X s 
based on X s are close to observational (i.e., conditional) 
distributions of X s " given X s , which means that the causal 
effects of X s on X s can be reliably identified without 
the need for active experimentation. On the other hand, if 
I (X s ;X S ) is equal to the ordinary mutual information 
I(X S ; X s ), then the variables in S have no causal effect on 
the remaining variables in S c , and any statistical dependence 
between X s and X s " must be along the (not necessarily 
directed) paths in the DAG that have some edges pointing 
toward S. 

The definitions of Massey and Tatikonda-Mitter apply 
only to the causal effect of X s on the entire complementary 
set X s . We can, however, consider an arbitrary set T C S c 
and use ( fT2] i as our definition of the directed information 
from X T to X s : 



I(X r -> X s ) 4 D{Pxt ]x s\\Pxt\xs\Pxs) 



(13) 



Note that for I(X T -» X s ) to be well-defined, we need 
to specify an appropriate Markovian dynamical system, 
where the interventional distribution P x t\ X s * s com P u ted 
according to ( [T0| >. 

An expression for the directed information I(X S — » X s ) 
can be obtained from the underlying graphical model. Indeed, 



note that we can write 



I{X S — ► X s ) — E log 



P x s, x sc(X s ,X sc ) 
P xSlxS (X^\XS)P x s(XS)\ 



Now, the probability distribution in the numerator is equal 
to P X n and can be assembled from the original Markov 
factorization, while the one in the denominator is the product 
of the interventional distribution P xS <^ x s (which can be 
read off from the transformed DAG obtained using the 



procedure illustrated in Section III-B i and the marginal 



distribution P x s according to the original model. The di- 
rected edges that are common to the original DAG and the 
transformed DAG correspond to the factors in the numerator 
and the denominator that can be cancelled. The remaining 
expression can then be represented as a sum of conditional 
mutual informations by exploiting appropriate conditional 
independence relations encoded in the original DAGQ 

A. Combining interventions and passive observations: con- 
ditional directed information 

We have already pointed out the different status of active 



interventions of the form X s 
passive observations X i 



<— x and conditioning on 
. Many problems pertaining 



to causality involve a combination of the two: given three 
disjoint sets S,S',T C [n], we may want to consider a mixed 
quantity P X t\ x s<-x s x s '=x s '- ^ n order for such an object to 
be well-defined, the conditioning on X s must be done w.r.t. 



the interventional distribution of P 



x- 



Px T \X s ^x s ,X s '=x s ' ( xT ) 



A Px s ' UT \X s - 



-, S (* S ' UT ) 



P, 



x s '\x s < 

In fact, this is the only sensible definition, because perform- 
ing the conditioning first may destroy the Markov structures 
that are needed to construct the interventional distribution. 

With the above definition, we may define the conditional 
directed information 

>-S'\ 



I{X J 



x s \x 



\X>) 



D(P : 



X T \XS,XS' 



\ p x T \x s ,x s '\ P x s ,xs') (l 4a ) 



E 



log 



Px T \x s ,x s '( xT \ xS >X S ) 
(X T \X S ,X S ') 



P 



X T \X S ,X S ' 



(14b) 



B. Some properties of directed information 

Let us illustrate the role of the directed information ( p"3j ) 
and the conditional directed information ( fT4| i in quantifying 
the causal flow of information in Markovian dynamical 
systems. We start with the following: 

Lemma 4. For any S C [n] and any T C N$, 

I(X T -> X s ) = I(X T ; X s ). 

1 We would like to thank Yury Polyanskiy for clarifications regarding this 
procedure. 



Moreover, for any T C (S U n,g) c , 
I(X S -> A nsUT ) - /(X s 



X ns ) = 0. 



Proof. This is just a restatement of Lemmas [2] and [5] in the 
language of directed information. □ 

We can also show that there are two contributions to 
the directed flow of information from X T to X s : (1) the 
ordinary mutual information between the variables in S and 
any nondescendants of S that happen to lie in T, and (2) the 
conditional directed information from the descendants of S 
in T to S, given the nondescendants of S in T: 

Proposition 1 (chain rule). For any two disjoint sets S, T c 
[n], we have 



I(X q 



X" 



= I(X TnNs ;X s )+I(X TnAs -> X s \X TnNs ). (15) 

Proof For brevity, let us denote T\ = T n N s and T 2 = 
Tn A<t (which is equal to T n A s since SnT = 0). Then 

i , XT| X s<_ a; s(ar r ) 

= - p x T i|xs^ a; s(a; Tl )P X T 2 |^s < _ :! .s :X T 1=:! ,T 1 (x T2 ) 

= ^1 (^O-PX^IXS^S^TJ^T! (X T2 ), 

where the second step uses Lemma [2] Similarly, 

PxT\XS= x s(x T ) 

= Px T i\XS =x s {x Ti )P x t 2 \ x s =x s ^x T i=x T i ( xT2 )- 

Therefore, 

p XTl{xs (x^\x s y 



I{X J -> X s ) = E 



log 



J(X Tl ;X s ) + I(X T2 -> A 5 |A" Tl ), 



■E 



log 



□ 



which gives us ( [13] ). 
Corollary 2. For any sef S C [n], 

/(A sc ^X s ) = Z(X w *;JC s )+Z(Jr As ->X s \X Nb ). 
Proof. Immediate from the proposition with T = S c . □ 

C. Examples: three canonical causal structures 

Many fundamental questions pertaining to causality 
(including the possibility of discovering causal influences 
from observational data) can be reduced to the study of three 
canonical causal structures involving three random variables 
X, Y, Z: the chain X -> Y -> Z; the fork X <- Y -> Z; 
and the co/&fer X -> F <- Z [16], [18]. We have the 
following examples of directed information relations for 
these structures: 



I(Y -+Z) = I(Y; Z) and I(Z ->Y) = Q. Moreover, since 
X is a nondescendant of Z, we have I(X — >• Z) = I(X; Z). 
On the other hand, I(Z ->■ X) = 0. 

Fork, y is the direct cause of X, so /(X -» Y) = 0, and it 
is a nondescendant of X, so I(Y X) = I(X;Y). 
The same goes for Y and Z: I(Z Y) = 

and /(y ->• Z) = I(Y;Z). Finally, we have 
/(X Z) = I(Z -> X) = I(X;Z), since there is 
no directed path from X to Z or from Z to X. 

Collider. The direction of the links between X and Y and 
between Z and Y is the reverse of that in the fork, so we 

have i(x -> y) = /(x ; y), j(y -)• x) = o, j(y -> 

Z) = 0, and I(Z — > y) = /(y;Z). Finally, since X is a 
nondescendant of Z, we have /(X — > Z) = I(X; Z) = 0; 
similarly, I(Z — > X) = I(X; Z) = 0, where we have also 
used the fact that X and Z are independent. 

V. Application to identification of causal 

EFFECTS 

One active area of interest in the studies of causality 
concerns identification of causal effects based on passive 
observations only. In the context of Markovian dynamical 
system models, this problem arises whenever only a subset 
of the variables X n is available for observation, the goal is 
to determine the causal effect of one group of variables in 
this subset upon another, and it is not possible or feasible to 
actively intervene into the system. Then the relevant question 
becomes: given a set V C [n] that indexes the variables 
available for observation, is it possible to express the causal 
effect P x ti x s for some disjoint sets S, T C V in terms of 
ordinary (noninterventional) probabilities? 

More precisely, let us assume that we know the structure 
of the underlying DAG (i.e., the sets Hi,i € [n]), but not 
the functions /j or the distributions Pjj. of the exogenous 
disturbances. What other variables besides those in S and T 
do we need to observe in order to estimate the causal effect 
P x ti x s7 The idea is that the ordinary conditional probabili- 
ties relating the variables in V can be estimated from passive 
observations, and so P x t\ x s can be estimated using a plug- 
in rule in terms of these conditional probabilities. 

One obvious answer is that it is sufficient to observe S, T, 
and all direct causes of the variables in S, i.e., those in lis. 
To see this, let us write down the interventional distribution 



P 



x T \x^ 



and condition on X ns : 



P X T\ X S^ X S{X T ) 

= X] P X T \XS^ x s ,X u s =x n s (x T )P x ii s \ x s^ x s (x 113 ) 



x n s 



= F X^|XVx s ,X n s=x n s (x T )P x n s (x 113 ), 



Chain. Since AT is a nondescendant of Y, we have 
I(X — >• y) = I(X;Y); since X is the direct cause 
of y we have I(Y —> X) = 0. Similarly, we have 



where the second step uses the fact that IT5 C 
Ng and Lemma [2] Now, it can be shown that 



Px T IXS-i-zS i x n s =x u s 



= P 



X T \XS= x s,X n s= x n s 



[18, 



Thm. 3.2.2], which is equivalent to I{X T X s \X ns ) = 0. 
This gives 

PXT\ X S^ X S{X T ) 

= Y. P X-\XS= x s^ x u s=x u s (x T )P x u s (x ns )- (16) 

x n s 

Thus, if we observe the variables in T, S, and lis, men 
we can use ( fl6| i to develop an estimate of the causal effect 
P x t\xs m terms of the conditional distribution Px T \x s x n s 
and the marginal distribution P x u s ■ Both of these quantities 
can, in turn, be estimated from passive observations. The 
intuitive meaning of (jT6]» is that we can estimate the causal 
effect of X s on X T without any need for active experi- 
mentation if we can control for the direct causes of X s , 
i.e., X ns . Whenever this is not possible, we would still 
like to know what other variables it suffices to observe in 
order for the causal effect P x t\ X s to ^ e identifiable. One 
sufficient condition due to Pearl, who termed it the "back- 
door criterion" [18, Sec. 3.3.1], says that certain subsets of 
the nondescendants of S can be used instead: 

Theorem 1 (the back-door criterion: directed information 
form). Let S,T C [n] be such that T is disjoint from S'LHIg. 
Then for any set Z C Ng the relation 

Pxt\ X s^ x s{x T ) 

= ^2P X T lx suz =x suz(x T )P x z(x Z ) (17) 

holds if and only if I(X T -> X S \X Z ) = 0. 
Proof. Let us condition on X z : 

Pxt\ X s^ x s(x T ) 

= '^2 Px T \X s ^x s ,X z =x z ( xT )Px z \X s ^x s ( xZ ) 
x z 

= ^2Pxt\xs^xS,x z =x z (x T )Px z {x Z ), 

x z 

where the second step uses the fact that Z C N$ 
and Lemma [2] The proof is finished using the fact that 

Pxt\xs^ x s,x z =x z = Pxt\xsuz =x suz for all x s , x z if 
and only if I(X T -> X S \X Z ) = 0. □ 

The original back-door criterion [18, Section 3.3.1] is stated 
in graphical terms using the notion of d- separation (a graph- 
based criterion for identifying conditional independence re- 
lations), so it can be checked without knowing {/;}™ =1 or 
{Pl/i}2=i' Conceptually, its equivalent information-theoretic 
form given by the above theorem is similar to statistical suf- 
ficiency: if Z C Ns, then X z may only depend functionally 
on X T (but not on X s or on any of the descendants of X s ), 
and if I(X S ; X T \X Z ) = 0, then X z is sufficient for X s in 
the ordinary Bayesian sense. 
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